delta-Clusters: Capturing Subspace Correlation in a Large Data Set

نویسندگان

  • Jiong Yang
  • Wei Wang
  • Haixun Wang
  • Philip S. Yu
چکیده

Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. Here, we introduce a more general model, referred to as the -cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The -cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the -cluster model and the FLOC algorithm on a number of real and synthetic data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Fast Algorithm for Finding Correlation Clusters in Noise Data

Noise significantly affects cluster quality. Conventional clustering methods hardly detect clusters in a data set containing a large amount of noise. Projected clustering sheds light on identifying correlation clusters in such a data set. In order to exclude noise points which are usually scattered in a subspace, data points are projected to form dense areas in the subspace that are regarded as...

متن کامل

-Clusters: Capturing Subspace Correlation in a Large Data Set

Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coher...

متن کامل

A TWO-STAGE METHOD FOR DAMAGE DETECTION OF LARGE-SCALE STRUCTURES

A novel two-stage algorithm for detection of damages in large-scale structures under static loads is presented. The technique utilizes the vector of response change (VRC) and sensitivities of responses with respect to the elemental damage parameters (RSEs). It is shown that VRC approximately lies in the subspace spanned by RSEs corresponding to the damaged elements. The property is leveraged in...

متن کامل

خوشه‌بندی داده‌ها بر پایه شناسایی کلید

Clustering has been one of the main building blocks in the fields of machine learning and computer vision. Given a pair-wise distance measure, it is challenging to find a proper way to identify a subset of representative exemplars and its associated cluster structures. Recent trend on big data analysis poses a more demanding requirement on new clustering algorithm to be both scalable and accura...

متن کامل

Data Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach

Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002